The institutionalized Standardized Testing of New York State has inaccurately reflected students’ mental capacities. This project explores whether or not there is a correlation between the New York Statewide English Language Arts and Math Exam results and student ethnicity across all school districts. In this project, the ethnic composition of test results will be explored to expose the inaccuracy and unfairness of standardized testing.
I examined the top scoring schools in NYC and took a closer look at their ethnic compositions. A correlation ranker revealed that as the population of certain ethnicities (typically non-minority groups such as White or Asian students) increased, so too did the mean scale score. The opposite also applies where as the population of certain ethnicities (typically minority groups such as Black or Hispanic students) decreased, so too did the mean scale score.
score_map
This project uses data from OpenData NYC on New York State Standardized Testing results from 2013-2018.
The ELA and Math statewide tests evaluate a student beginning from 3rd grade until 8th grade on their ability to solve Common Core questions. Students are evaluated on a scale from 1-4, 1 being the lowest and 4 being the highest. These datasets provide information on schools in New York City and their reported test scores. The information includes the percentage and number of students who received a certain exam score. Information is provided for every grade at the listed school.
This dataset provides information on the demographic snapshot of the same schools listed in the statewide exam result datasets. The information provided includes the total number of students in each grade, gender composition, ethnic composition and economic composition of students. Only the ethnic composition data is utilized in this project as the ratio of female to male remained around an even split of 50% and the timespan of the project would not allow for an in-depth exploration of the economic composition in relation to the exam scores.
This dataset was used to create a map of the school locations.
To explore the datasets and test if there existed an ethnic bias in standardized testing, I utilized my knowledge of pandas to effectively clean and gather data. With my previous experience in json files, I was able to get data from the OpenData NYC website without downloading csv files. I used a correlation ranker and scatter plots to explore if there existed a correlation between ethnicity and exam scores.
https://towardsdatascience.com/choropleth-maps-with-folium-1a5b8bcdd392
https://opendata.cityofnewyork.us/projects/introduction-to-choropleth-maps/
https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data/notebook
Datasets:
https://data.cityofnewyork.us/Education/2013-2019-English-Language-Arts-ELA-Test-Results-S/gu76-8i7h
https://data.cityofnewyork.us/Education/2013-2019-Math-Test-Results-School-SWD-Ethnicity-G/74ah-8ukf
https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6
https://data.cityofnewyork.us/Education/2017-2018-School-Locations/p6h4-mpyy
import pandas as pd
import pandasql as psql
import requests
import json
import folium
from folium import plugins
import matplotlib.pyplot as plt
%matplotlib inline
This project explores data on New York Statewide English Language Arts and Math exams from the 2012-13 to 2017-18 school years. In the 2018-19 school year, the statewide exam was changed from three days of testing to two. Thus, data from the 2018-19 school year was omitted. The project also utilizes geographical data to create a mapped visualization of the correlation between exam results and ethnicity.
The NY Statewide ELA and Math Exams are graded on a scale of 1-4 where:
res = requests.get('https://data.cityofnewyork.us/resource/gu76-8i7h.json?$limit=32826')
cols = {'dbn': 'DBN', 'school_name':'School Name', 'grade': 'Grade', 'year':'Year', \
'number_tested':'Number Tested', 'mean_scale_score':'Mean Scale Score','level_1':'# Level 1', 'level_1_1':'% Level 1', \
'level_2':'# Level 2', 'level_2_1':'% Level 2', 'level_3':'# Level 3', 'level_3_1':'% Level 3', \
'level_4':'# Level 4', 'level_4_1':'% Level 4', 'level_3_4': '# Level 3+4', 'level_3_4_1':'% Level 3+4'}
# Extract columns from json and rename the columns
ela_df = pd.DataFrame(res.json(), columns = cols).rename(columns = cols)
ela_df = ela_df.loc[ela_df['Mean Scale Score'] != 's'] # there exists rows without number values
ela_df = ela_df.loc[ela_df['Grade'] != 'All Grades']
ela_df
| DBN | School Name | Grade | Year | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2013 | 27 | 289 | 14 | 51.9 | 11 | 40.7 | 2 | 7.4 | 0 | 0.0 | 2 | 7.4 |
| 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2014 | 18 | 285 | 10 | 55.6 | 8 | 44.4 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| 2 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2015 | 16 | 282 | 9 | 56.3 | 5 | 31.3 | 2 | 12.5 | 0 | 0.0 | 2 | 12.5 |
| 3 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2016 | 20 | 293 | 10 | 50.0 | 6 | 30.0 | 4 | 20.0 | 0 | 0.0 | 4 | 20.0 |
| 4 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2017 | 27 | 302 | 10 | 37.0 | 8 | 29.6 | 7 | 25.9 | 2 | 7.4 | 9 | 33.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32814 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2015 | 124 | 285 | 52 | 41.9 | 60 | 48.4 | 10 | 8.1 | 2 | 1.6 | 12 | 9.7 |
| 32815 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2016 | 92 | 291 | 27 | 29.3 | 53 | 57.6 | 11 | 12.0 | 1 | 1.1 | 12 | 13.0 |
| 32816 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2017 | 85 | 291 | 26 | 30.6 | 44 | 51.8 | 12 | 14.1 | 3 | 3.5 | 15 | 17.6 |
| 32817 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2018 | 90 | 591 | 24 | 26.7 | 48 | 53.3 | 16 | 17.8 | 2 | 2.2 | 18 | 20.0 |
| 32818 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2019 | 109 | 596 | 16 | 14.7 | 57 | 52.3 | 28 | 25.7 | 8 | 7.3 | 36 | 33.0 |
25205 rows × 16 columns
Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2019-English-Language-Arts-ELA-Test-Results-S/gu76-8i7h
res = requests.get('https://data.cityofnewyork.us/resource/74ah-8ukf.json?$limit=32826')
# Extract columns from json and rename the columns (same columns as ela_df)
math_df = pd.DataFrame(res.json(), columns = cols).rename(columns = cols)
math_df = math_df.loc[math_df['Mean Scale Score'] != 's'] # there exists rows without number values
math_df = math_df.loc[math_df['Grade'] != 'All Grades']
math_df
| DBN | School Name | Grade | Year | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2013 | 27 | 278 | 16 | 59.3 | 11 | 40.7 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2014 | 18 | 286 | 6 | 33.3 | 9 | 50.0 | 2 | 11.1 | 1 | 5.6 | 3 | 16.7 |
| 2 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2015 | 17 | 280 | 10 | 58.8 | 4 | 23.5 | 2 | 11.8 | 1 | 5.9 | 3 | 17.6 |
| 3 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2016 | 21 | 275 | 13 | 61.9 | 4 | 19.0 | 4 | 19.0 | 0 | 0.0 | 4 | 19.0 |
| 4 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2017 | 29 | 302 | 8 | 27.6 | 9 | 31.0 | 7 | 24.1 | 5 | 17.2 | 12 | 41.4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32594 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 7 | 2019 | 116 | 590 | 57 | 49.1 | 39 | 33.6 | 19 | 16.4 | 1 | 0.9 | 20 | 17.2 |
| 32595 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2015 | 128 | 268 | 97 | 75.8 | 31 | 24.2 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| 32596 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2016 | 91 | 274 | 55 | 60.4 | 33 | 36.3 | 3 | 3.3 | 0 | 0.0 | 3 | 3.3 |
| 32597 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2017 | 88 | 268 | 59 | 67.0 | 25 | 28.4 | 3 | 3.4 | 1 | 1.1 | 4 | 4.5 |
| 32598 | 32K562 | EVERGREEN MIDDLE SCHOOL FOR URBAN EXPLORATION | 8 | 2019 | 108 | 595 | 53 | 49.1 | 34 | 31.5 | 16 | 14.8 | 5 | 4.6 | 21 | 19.4 |
24953 rows × 16 columns
Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2019-Math-Test-Results-School-SWD-Ethnicity-G/74ah-8ukf
To effectively work with these number values, we need to convert columns to the appropriate datatypes. This applies to both math and ela dataframes.
ela_df.dtypes
DBN object School Name object Grade object Year object Number Tested object Mean Scale Score object # Level 1 object % Level 1 object # Level 2 object % Level 2 object # Level 3 object % Level 3 object # Level 4 object % Level 4 object # Level 3+4 object % Level 3+4 object dtype: object
ela_df['Grade'] = ela_df['Grade'].astype(int)
math_df['Grade'] = math_df['Grade'].astype(int)
cols = ['Number Tested', 'Mean Scale Score', '# Level 1','% Level 1', '# Level 2','% Level 2',\
'# Level 3','% Level 3', '# Level 4','% Level 4', '# Level 3+4', '% Level 3+4']
for col in cols:
ela_df[col] = ela_df[col].astype(float)
math_df[col] = math_df[col].astype(float)
ela_df.dtypes
DBN object School Name object Grade int64 Year object Number Tested float64 Mean Scale Score float64 # Level 1 float64 % Level 1 float64 # Level 2 float64 % Level 2 float64 # Level 3 float64 % Level 3 float64 # Level 4 float64 % Level 4 float64 # Level 3+4 float64 % Level 3+4 float64 dtype: object
res = requests.get('https://data.cityofnewyork.us/resource/s52a-8aq6.json?$limit=8972')
cols = {'dbn':'DBN', 'school_name':'School Name', 'year':'Year', 'total_enrollment':'Total Enrollment', \
'grade_3':'Grade 3', 'grade_4':'Grade 4','grade_5':'Grade 5','grade_6':'Grade 6','grade_7':'Grade 7', \
'grade_8':'Grade 8','asian_1':'# Asian', 'asian_2':'% Asian', 'black_1':'# Black', 'black_2':'% Black', \
'hispanic_1':'# Hispanic', 'hispanic_2':'% Hispanic', 'multiple_race_categories_not_represented_1':'# Other',\
'multiple_race_categories_not_represented_2':'% Other', 'white_1':'# White', 'white_2':'% White'}
#Extract columns from json and rename the columns to easily distinguish # and %
demographics_df = pd.DataFrame(res.json(), columns = cols).rename(columns=cols)
demographics_df
| DBN | School Name | Year | Total Enrollment | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01M015 | P.S. 015 Roberto Clemente | 2013-14 | 190 | 16 | 26 | 23 | 0 | 0 | 0 | 9 | 4.7 | 72 | 37.9 | 104 | 54.7 | 2 | 1.1 | 3 | 1.6 |
| 1 | 01M015 | P.S. 015 Roberto Clemente | 2014-15 | 183 | 19 | 17 | 24 | 0 | 0 | 0 | 8 | 4.4 | 65 | 35.5 | 107 | 58.5 | 1 | 0.5 | 2 | 1.1 |
| 2 | 01M015 | P.S. 015 Roberto Clemente | 2015-16 | 176 | 23 | 17 | 18 | 0 | 0 | 0 | 9 | 5.1 | 57 | 32.4 | 105 | 59.7 | 3 | 1.7 | 2 | 1.1 |
| 3 | 01M015 | P.S. 015 Roberto Clemente | 2016-17 | 178 | 31 | 24 | 18 | 0 | 0 | 0 | 14 | 7.9 | 51 | 28.7 | 105 | 59.0 | 4 | 2.2 | 4 | 2.2 |
| 4 | 01M015 | P.S. 015 Roberto Clemente | 2017-18 | 190 | 23 | 31 | 26 | 0 | 0 | 0 | 20 | 10.5 | 52 | 27.4 | 110 | 57.9 | 2 | 1.1 | 6 | 3.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8967 | 84X730 | Bronx Charter School for the Arts | 2013-14 | 319 | 56 | 53 | 48 | 0 | 0 | 0 | 1 | 0.3 | 97 | 30.4 | 212 | 66.5 | 6 | 1.9 | 3 | 0.9 |
| 8968 | 84X730 | Bronx Charter School for the Arts | 2014-15 | 316 | 53 | 47 | 52 | 0 | 0 | 0 | 2 | 0.6 | 93 | 29.4 | 214 | 67.7 | 5 | 1.6 | 2 | 0.6 |
| 8969 | 84X730 | Bronx Charter School for the Arts | 2015-16 | 323 | 55 | 57 | 48 | 0 | 0 | 0 | 2 | 0.6 | 92 | 28.5 | 218 | 67.5 | 6 | 1.9 | 5 | 1.5 |
| 8970 | 84X730 | Bronx Charter School for the Arts | 2016-17 | 320 | 52 | 53 | 52 | 2 | 0 | 0 | 2 | 0.6 | 76 | 23.7 | 236 | 73.8 | 3 | 0.9 | 3 | 0.9 |
| 8971 | 84X730 | Bronx Charter School for the Arts | 2017-18 | 314 | 51 | 52 | 54 | 2 | 0 | 0 | 2 | 0.6 | 65 | 20.7 | 243 | 77.4 | 3 | 1.0 | 1 | 0.3 |
8972 rows × 20 columns
Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6
I utilized apply() to extract the district for all schools for the ELA, Math, and Demographic dataframes. This will be used to visualize by district.
"""
@param name: string of public school code formatted as "01M015"
@return: district extracted from public school code
"""
def extractDistrict(name):
return int(name[:2])
ela_df.insert(0, 'District', ela_df['DBN'].apply(extractDistrict))
ela_df.head()
| District | DBN | School Name | Grade | Year | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2013 | 27.0 | 289.0 | 14.0 | 51.9 | 11.0 | 40.7 | 2.0 | 7.4 | 0.0 | 0.0 | 2.0 | 7.4 |
| 1 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2014 | 18.0 | 285.0 | 10.0 | 55.6 | 8.0 | 44.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2015 | 16.0 | 282.0 | 9.0 | 56.3 | 5.0 | 31.3 | 2.0 | 12.5 | 0.0 | 0.0 | 2.0 | 12.5 |
| 3 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2016 | 20.0 | 293.0 | 10.0 | 50.0 | 6.0 | 30.0 | 4.0 | 20.0 | 0.0 | 0.0 | 4.0 | 20.0 |
| 4 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2017 | 27.0 | 302.0 | 10.0 | 37.0 | 8.0 | 29.6 | 7.0 | 25.9 | 2.0 | 7.4 | 9.0 | 33.3 |
math_df.insert(0, 'District', math_df['DBN'].apply(extractDistrict))
math_df.head()
| District | DBN | School Name | Grade | Year | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2013 | 27.0 | 278.0 | 16.0 | 59.3 | 11.0 | 40.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2014 | 18.0 | 286.0 | 6.0 | 33.3 | 9.0 | 50.0 | 2.0 | 11.1 | 1.0 | 5.6 | 3.0 | 16.7 |
| 2 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2015 | 17.0 | 280.0 | 10.0 | 58.8 | 4.0 | 23.5 | 2.0 | 11.8 | 1.0 | 5.9 | 3.0 | 17.6 |
| 3 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2016 | 21.0 | 275.0 | 13.0 | 61.9 | 4.0 | 19.0 | 4.0 | 19.0 | 0.0 | 0.0 | 4.0 | 19.0 |
| 4 | 1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 3 | 2017 | 29.0 | 302.0 | 8.0 | 27.6 | 9.0 | 31.0 | 7.0 | 24.1 | 5.0 | 17.2 | 12.0 | 41.4 |
demographics_df.insert(0, 'District', demographics_df['DBN'].apply(extractDistrict))
demographics_df
| District | DBN | School Name | Year | Total Enrollment | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | ... | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2013-14 | 190 | 16 | 26 | 23 | 0 | 0 | ... | 9 | 4.7 | 72 | 37.9 | 104 | 54.7 | 2 | 1.1 | 3 | 1.6 |
| 1 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2014-15 | 183 | 19 | 17 | 24 | 0 | 0 | ... | 8 | 4.4 | 65 | 35.5 | 107 | 58.5 | 1 | 0.5 | 2 | 1.1 |
| 2 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2015-16 | 176 | 23 | 17 | 18 | 0 | 0 | ... | 9 | 5.1 | 57 | 32.4 | 105 | 59.7 | 3 | 1.7 | 2 | 1.1 |
| 3 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2016-17 | 178 | 31 | 24 | 18 | 0 | 0 | ... | 14 | 7.9 | 51 | 28.7 | 105 | 59.0 | 4 | 2.2 | 4 | 2.2 |
| 4 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2017-18 | 190 | 23 | 31 | 26 | 0 | 0 | ... | 20 | 10.5 | 52 | 27.4 | 110 | 57.9 | 2 | 1.1 | 6 | 3.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8967 | 84 | 84X730 | Bronx Charter School for the Arts | 2013-14 | 319 | 56 | 53 | 48 | 0 | 0 | ... | 1 | 0.3 | 97 | 30.4 | 212 | 66.5 | 6 | 1.9 | 3 | 0.9 |
| 8968 | 84 | 84X730 | Bronx Charter School for the Arts | 2014-15 | 316 | 53 | 47 | 52 | 0 | 0 | ... | 2 | 0.6 | 93 | 29.4 | 214 | 67.7 | 5 | 1.6 | 2 | 0.6 |
| 8969 | 84 | 84X730 | Bronx Charter School for the Arts | 2015-16 | 323 | 55 | 57 | 48 | 0 | 0 | ... | 2 | 0.6 | 92 | 28.5 | 218 | 67.5 | 6 | 1.9 | 5 | 1.5 |
| 8970 | 84 | 84X730 | Bronx Charter School for the Arts | 2016-17 | 320 | 52 | 53 | 52 | 2 | 0 | ... | 2 | 0.6 | 76 | 23.7 | 236 | 73.8 | 3 | 0.9 | 3 | 0.9 |
| 8971 | 84 | 84X730 | Bronx Charter School for the Arts | 2017-18 | 314 | 51 | 52 | 54 | 2 | 0 | ... | 2 | 0.6 | 65 | 20.7 | 243 | 77.4 | 3 | 1.0 | 1 | 0.3 |
8972 rows × 21 columns
res = requests.get('https://data.cityofnewyork.us/resource/p6h4-mpyy.json?$limit=1823')
cols = {"geographical_district_code":"District", "ats_system_code":"DBN", "location_name":"School Name", "location_1":"Location"}
location_df = pd.DataFrame(res.json(), columns=cols).rename(columns=cols)
location_df.head()
| District | DBN | School Name | Location | |
|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 Roberto Clemente | {'latitude': '40.722075', 'longitude': '-73.97... |
| 1 | 1 | 01M019 | P.S. 019 Asher Levy | {'latitude': '40.730009', 'longitude': '-73.98... |
| 2 | 1 | 01M020 | P.S. 020 Anna Silver | {'latitude': '40.721305', 'longitude': '-73.98... |
| 3 | 1 | 01M034 | P.S. 034 Franklin D. Roosevelt | {'latitude': '40.726008', 'longitude': '-73.97... |
| 4 | 1 | 01M063 | The STAR Academy - P.S.63 | {'latitude': '40.72444', 'longitude': '-73.986... |
Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2017-2018-School-Locations/p6h4-mpyy
After cleaning the data to extract only the necessary columns and renaming them to consist with the other dataframes, the location column contains coordinates that need to be extracted.
If we take a look closer at one of the entries, it appears to be a Python dictionary. With this, indexing and extracting the columns needed can be done easily.
location_df["Location"][0]
{'latitude': '40.722075',
'longitude': '-73.978747',
'human_address': '{"address": "333 EAST 4 STREET", "city": "MANHATTAN", "state": "NY", "zip": "10009"}'}
lat = []
lon = []
for idx, row in location_df.iterrows():
location = row['Location']
if type(location) is float:
lat.append(0)
lon.append(0)
else:
lat.append(location['latitude'])
lon.append(location['longitude'])
location_df['Latitude'] = lat
location_df['Longitude'] = lon
location_df = location_df.loc[location_df['Latitude'] != 0]
location_df
| District | DBN | School Name | Location | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 Roberto Clemente | {'latitude': '40.722075', 'longitude': '-73.97... | 40.722075 | -73.978747 |
| 1 | 1 | 01M019 | P.S. 019 Asher Levy | {'latitude': '40.730009', 'longitude': '-73.98... | 40.730009 | -73.984496 |
| 2 | 1 | 01M020 | P.S. 020 Anna Silver | {'latitude': '40.721305', 'longitude': '-73.98... | 40.721305 | -73.986312 |
| 3 | 1 | 01M034 | P.S. 034 Franklin D. Roosevelt | {'latitude': '40.726008', 'longitude': '-73.97... | 40.726008 | -73.975058 |
| 4 | 1 | 01M063 | The STAR Academy - P.S.63 | {'latitude': '40.72444', 'longitude': '-73.986... | 40.72444 | -73.986214 |
| ... | ... | ... | ... | ... | ... | ... |
| 1818 | 9 | 84X705 | Family Life Academy Charter School | {'latitude': '40.840262', 'longitude': '-73.91... | 40.840262 | -73.918665 |
| 1819 | 9 | 84X706 | Harriet Tubman Charter School | {'latitude': '40.832328', 'longitude': '-73.90... | 40.832328 | -73.905771 |
| 1820 | 9 | 84X717 | Icahn Charter School | {'latitude': '40.838654', 'longitude': '-73.90... | 40.838654 | -73.904703 |
| 1821 | 11 | 84X718 | Bronx Charter School for Better Learning | {'latitude': '40.884371', 'longitude': '-73.83... | 40.884371 | -73.839288 |
| 1822 | 8 | 84X730 | Bronx Charter School for the Arts | {'latitude': '40.821815', 'longitude': '-73.88... | 40.821815 | -73.886477 |
1822 rows × 6 columns
def groupByDistrictGrade(df, columns):
d = df.groupby('District')
data = {key: [] for key in columns}
for i in range(1, 33): # loop for districts 1-32
district = d.get_group(i)
grades = district['Grade'].unique()
grade = district.groupby('Grade')
for g in range(grades[0], grades[-1] + 1): # loop for grades 3-8
data['District'].append(i)
data['Grade'].append(g)
data['Number Tested'].append(grade.get_group(g)['Number Tested'].sum().astype(int))
for col in columns[3:]: # ignore District, Grade, and Number Tested columns
for g in range(grades[0], grades[-1] + 1): # loop for possible grades in district (not all districts are gr3-8)
data[col].append(float("{:.2f}".format(grade.get_group(g)[col].mean())))
df = pd.DataFrame.from_dict(data)
return df
columns = ['District', 'Grade', 'Number Tested', 'Mean Scale Score', '# Level 1', '% Level 1', '# Level 2',
'% Level 2', '# Level 3', '% Level 3', '# Level 4', '% Level 4', '# Level 3+4', '% Level 3+4']
ela_averages = groupByDistrictGrade(ela_df, columns)
ela_averages
| District | Grade | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 4982 | 389.73 | 10.15 | 26.38 | 11.95 | 33.12 | 14.42 | 33.17 | 5.70 | 7.33 | 20.12 | 40.50 |
| 1 | 1 | 4 | 4908 | 389.32 | 8.30 | 22.56 | 13.42 | 37.17 | 10.34 | 25.65 | 9.18 | 14.62 | 19.52 | 40.27 |
| 2 | 1 | 5 | 4771 | 388.95 | 11.14 | 31.01 | 11.80 | 32.54 | 8.74 | 21.33 | 8.41 | 15.10 | 17.15 | 36.44 |
| 3 | 1 | 6 | 4529 | 388.70 | 14.98 | 26.02 | 22.62 | 35.44 | 11.98 | 15.96 | 22.30 | 22.60 | 34.29 | 38.55 |
| 4 | 1 | 7 | 4558 | 390.78 | 15.71 | 26.35 | 21.63 | 33.40 | 19.25 | 24.72 | 15.75 | 15.53 | 35.00 | 40.25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 163 | 32 | 4 | 6443 | 375.99 | 27.29 | 37.79 | 27.35 | 39.12 | 12.10 | 17.34 | 4.07 | 5.75 | 16.16 | 23.09 |
| 164 | 32 | 5 | 6647 | 375.37 | 32.12 | 47.02 | 22.30 | 32.53 | 9.85 | 14.99 | 3.56 | 5.47 | 13.41 | 20.46 |
| 165 | 32 | 6 | 7362 | 373.36 | 38.06 | 38.92 | 39.55 | 35.93 | 16.23 | 13.62 | 12.86 | 11.52 | 29.09 | 25.14 |
| 166 | 32 | 7 | 7285 | 377.76 | 37.91 | 37.18 | 40.16 | 37.18 | 23.04 | 19.94 | 6.01 | 5.71 | 29.06 | 25.65 |
| 167 | 32 | 8 | 7252 | 383.21 | 29.80 | 27.16 | 42.83 | 40.17 | 26.08 | 22.81 | 11.17 | 9.84 | 37.24 | 32.67 |
168 rows × 14 columns
math_averages = groupByDistrictGrade(math_df, columns)
math_averages
| District | Grade | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | # Level 4 | % Level 4 | # Level 3+4 | % Level 3+4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 4999 | 390.96 | 9.68 | 27.21 | 10.26 | 27.95 | 9.96 | 24.15 | 12.83 | 20.70 | 22.79 | 44.84 |
| 1 | 1 | 4 | 4935 | 388.56 | 10.21 | 28.83 | 10.39 | 29.29 | 8.36 | 21.06 | 12.50 | 20.83 | 20.87 | 41.89 |
| 2 | 1 | 5 | 4789 | 391.64 | 11.48 | 32.77 | 9.08 | 26.18 | 8.32 | 21.21 | 11.37 | 19.84 | 19.69 | 41.05 |
| 3 | 1 | 6 | 4558 | 392.57 | 16.00 | 27.72 | 18.90 | 29.10 | 13.16 | 17.61 | 24.29 | 25.57 | 37.44 | 43.20 |
| 4 | 1 | 7 | 4453 | 391.52 | 19.35 | 32.69 | 18.87 | 28.99 | 15.17 | 19.43 | 17.29 | 18.90 | 32.46 | 38.32 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 163 | 32 | 4 | 6586 | 374.76 | 28.58 | 39.83 | 25.34 | 35.35 | 12.32 | 16.61 | 6.13 | 8.20 | 18.45 | 24.81 |
| 164 | 32 | 5 | 6778 | 379.79 | 31.81 | 45.85 | 20.18 | 28.70 | 12.40 | 18.12 | 4.78 | 7.32 | 17.17 | 25.45 |
| 165 | 32 | 6 | 7503 | 372.38 | 43.06 | 42.23 | 39.51 | 33.99 | 17.22 | 14.78 | 8.96 | 9.01 | 26.17 | 23.79 |
| 166 | 32 | 7 | 7434 | 375.94 | 49.06 | 46.53 | 36.68 | 31.35 | 17.93 | 16.02 | 5.66 | 6.10 | 23.59 | 22.13 |
| 167 | 32 | 8 | 6534 | 369.97 | 49.66 | 45.67 | 42.41 | 38.45 | 13.81 | 12.20 | 4.86 | 3.68 | 18.68 | 15.87 |
168 rows × 14 columns
The percentage of students who received a satisfactory score (3 or 4) on the statewide exam is reflected in % Level 3+4 column. Let's take a look at the range of values for every grade in all districts.
def calculateRangeByGrade(df, col):
df = df.groupby('Grade')
data = {'Grade': [], 'Lowest': [], 'Highest': [], 'Range': []}
for i in range(3, 9):
highest = df.get_group(i)[col].max()
lowest = df.get_group(i)[col].min()
rnge = float("{:.2f}".format(highest-lowest))
data['Grade'].append(i)
data['Lowest'].append(lowest)
data['Highest'].append(highest)
data['Range'].append(rnge)
return pd.DataFrame.from_dict(data)
ela_ranges_bygrade = calculateRangeByGrade(ela_averages, '% Level 3+4')
ela_ranges_bygrade
| Grade | Lowest | Highest | Range | |
|---|---|---|---|---|
| 0 | 3 | 20.78 | 67.36 | 46.58 |
| 1 | 4 | 19.25 | 69.13 | 49.88 |
| 2 | 5 | 15.60 | 63.69 | 48.09 |
| 3 | 6 | 15.48 | 60.62 | 45.14 |
| 4 | 7 | 15.61 | 62.92 | 47.31 |
| 5 | 8 | 22.11 | 67.30 | 45.19 |
math_ranges_bygrade = calculateRangeByGrade(math_averages, '% Level 3+4')
math_ranges_bygrade
| Grade | Lowest | Highest | Range | |
|---|---|---|---|---|
| 0 | 3 | 21.39 | 75.27 | 53.88 |
| 1 | 4 | 16.18 | 75.28 | 59.10 |
| 2 | 5 | 15.15 | 73.85 | 58.70 |
| 3 | 6 | 14.33 | 68.84 | 54.51 |
| 4 | 7 | 12.97 | 68.96 | 55.99 |
| 5 | 8 | 9.62 | 52.45 | 42.83 |
In both the ELA and math averages, there is a high difference between districts. In the ELA data, one district had 20.78% of their third grade students from 2013-18 receive a satisfactory score of 3 or 4. Another district had 67.36% of their third grade students from 2013-18 receive a 3 or 4. This is nearly a 50% difference. In the math data, one district had 15.15% of their 5th grade students from 2013-18 receive of 3 or 4 and another had 73.85%. That is nearly a 60% difference.
With this information, it is evident there is a difference between districts. In this project, I explore how ethnicity possibly influences the averages. So, let's take a look at each districts' ethnic composition.
demographics_df.head()
| District | DBN | School Name | Year | Total Enrollment | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | ... | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2013-14 | 190 | 16 | 26 | 23 | 0 | 0 | ... | 9 | 4.7 | 72 | 37.9 | 104 | 54.7 | 2 | 1.1 | 3 | 1.6 |
| 1 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2014-15 | 183 | 19 | 17 | 24 | 0 | 0 | ... | 8 | 4.4 | 65 | 35.5 | 107 | 58.5 | 1 | 0.5 | 2 | 1.1 |
| 2 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2015-16 | 176 | 23 | 17 | 18 | 0 | 0 | ... | 9 | 5.1 | 57 | 32.4 | 105 | 59.7 | 3 | 1.7 | 2 | 1.1 |
| 3 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2016-17 | 178 | 31 | 24 | 18 | 0 | 0 | ... | 14 | 7.9 | 51 | 28.7 | 105 | 59.0 | 4 | 2.2 | 4 | 2.2 |
| 4 | 1 | 01M015 | P.S. 015 Roberto Clemente | 2017-18 | 190 | 23 | 31 | 26 | 0 | 0 | ... | 20 | 10.5 | 52 | 27.4 | 110 | 57.9 | 2 | 1.1 | 6 | 3.2 |
5 rows × 21 columns
First, let's convert the datatypes of the appropriate columns to floats.
columns = list(demographics_df.columns)
demographics_df[columns[4:]] = demographics_df[columns[4:]].astype(float)
demographics_df.dtypes
District int64 DBN object School Name object Year object Total Enrollment float64 Grade 3 float64 Grade 4 float64 Grade 5 float64 Grade 6 float64 Grade 7 float64 Grade 8 float64 # Asian float64 % Asian float64 # Black float64 % Black float64 # Hispanic float64 % Hispanic float64 # Other float64 % Other float64 # White float64 % White float64 dtype: object
Now, let's average the demographics from 2013-18 for each district.
def avgDemographics(df):
districts = df.groupby('District')
columns = list(df.columns)
columns.remove('DBN')
columns.remove('School Name')
columns.remove('Year')
data = {key: [] for key in columns}
for i in range(1, 33): #Districts 1-32 (To match ELA and Math datasets)
district = districts.get_group(i)
data['District'].append(i)
data['Total Enrollment'].append(int(district['Total Enrollment'].sum()))
for col in columns[2:]:
data[col].append(float("{:.2f}".format(district[col].mean())))
return pd.DataFrame.from_dict(data)
avg_demographics = avgDemographics(demographics_df)
avg_demographics
| District | Total Enrollment | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 56935 | 29.67 | 29.52 | 29.00 | 27.23 | 26.96 | 27.92 | 84.95 | 15.72 | 65.06 | 19.69 | 161.71 | 48.40 | 11.34 | 2.40 | 69.60 | 13.80 |
| 1 | 2 | 309227 | 24.44 | 24.16 | 23.27 | 22.04 | 21.03 | 20.94 | 122.04 | 18.55 | 83.58 | 17.66 | 173.82 | 35.67 | 19.88 | 3.85 | 130.18 | 24.27 |
| 2 | 3 | 111386 | 31.78 | 31.24 | 30.66 | 31.96 | 31.32 | 30.87 | 40.11 | 5.85 | 114.92 | 29.32 | 167.87 | 38.07 | 21.30 | 3.51 | 155.29 | 23.25 |
| 3 | 4 | 65764 | 33.44 | 33.19 | 33.12 | 31.89 | 31.24 | 31.19 | 31.29 | 5.72 | 106.01 | 27.38 | 257.30 | 60.10 | 8.76 | 2.18 | 18.21 | 4.62 |
| 4 | 5 | 57508 | 29.19 | 29.14 | 26.95 | 26.56 | 27.46 | 28.79 | 14.79 | 3.46 | 196.57 | 50.71 | 158.82 | 39.35 | 8.68 | 2.22 | 17.74 | 4.26 |
| 5 | 6 | 112867 | 41.44 | 41.56 | 36.09 | 34.42 | 36.45 | 38.63 | 6.31 | 1.23 | 35.41 | 6.72 | 417.75 | 86.38 | 3.96 | 0.85 | 20.97 | 4.83 |
| 6 | 7 | 93490 | 31.25 | 30.50 | 30.24 | 27.29 | 28.18 | 27.92 | 5.35 | 1.11 | 121.17 | 26.85 | 321.45 | 69.82 | 4.20 | 0.92 | 6.12 | 1.31 |
| 7 | 8 | 137016 | 40.48 | 39.89 | 39.58 | 37.87 | 37.83 | 38.40 | 28.23 | 4.89 | 115.99 | 25.14 | 326.42 | 63.51 | 6.54 | 1.36 | 30.28 | 5.10 |
| 8 | 9 | 170923 | 42.11 | 40.26 | 37.98 | 38.59 | 38.90 | 39.19 | 7.03 | 1.48 | 144.73 | 29.26 | 337.32 | 67.38 | 3.77 | 0.77 | 5.46 | 1.12 |
| 9 | 10 | 271920 | 51.21 | 50.17 | 48.51 | 44.23 | 43.66 | 44.56 | 48.67 | 4.42 | 104.68 | 18.00 | 437.40 | 71.83 | 9.34 | 1.32 | 36.73 | 4.44 |
| 10 | 11 | 188833 | 53.46 | 52.92 | 52.07 | 46.37 | 46.74 | 46.65 | 46.49 | 6.50 | 246.85 | 41.67 | 256.35 | 42.02 | 11.09 | 1.97 | 50.33 | 7.84 |
| 11 | 12 | 114758 | 38.51 | 37.71 | 35.54 | 33.81 | 34.78 | 35.09 | 11.12 | 2.13 | 119.37 | 26.79 | 323.95 | 68.57 | 6.20 | 1.22 | 5.86 | 1.29 |
| 12 | 13 | 106627 | 27.93 | 26.47 | 24.42 | 17.66 | 17.98 | 18.56 | 101.96 | 6.72 | 245.11 | 63.19 | 82.71 | 19.73 | 15.63 | 2.51 | 69.70 | 7.85 |
| 13 | 14 | 94470 | 33.04 | 32.61 | 30.95 | 32.14 | 32.59 | 33.66 | 22.74 | 3.61 | 111.68 | 23.24 | 291.83 | 60.36 | 8.02 | 1.54 | 62.94 | 11.24 |
| 14 | 15 | 154114 | 62.43 | 59.78 | 56.40 | 40.31 | 38.17 | 36.81 | 101.14 | 11.86 | 97.09 | 20.38 | 250.85 | 38.31 | 18.51 | 2.99 | 177.25 | 26.46 |
| 15 | 16 | 33984 | 25.47 | 24.76 | 22.81 | 14.49 | 15.41 | 16.22 | 5.22 | 1.87 | 219.34 | 75.85 | 51.82 | 18.89 | 3.98 | 1.43 | 5.22 | 1.95 |
| 16 | 17 | 111605 | 31.59 | 30.72 | 29.04 | 31.99 | 33.77 | 35.07 | 12.49 | 2.69 | 354.11 | 77.63 | 68.56 | 14.88 | 9.49 | 2.17 | 10.89 | 2.64 |
| 17 | 18 | 78828 | 37.62 | 38.79 | 37.04 | 34.25 | 35.74 | 37.72 | 6.30 | 1.30 | 411.22 | 87.51 | 33.11 | 7.73 | 6.15 | 1.41 | 9.66 | 2.05 |
| 18 | 19 | 109363 | 39.11 | 39.29 | 37.37 | 34.39 | 33.24 | 31.68 | 33.00 | 5.37 | 215.62 | 50.55 | 196.08 | 40.93 | 7.58 | 1.55 | 7.23 | 1.60 |
| 19 | 20 | 249139 | 93.97 | 91.23 | 88.25 | 88.97 | 86.85 | 87.49 | 491.28 | 42.56 | 37.22 | 2.81 | 304.12 | 26.28 | 10.66 | 0.91 | 299.57 | 27.45 |
| 20 | 21 | 173386 | 54.28 | 53.30 | 50.74 | 72.27 | 71.01 | 70.34 | 216.91 | 22.87 | 135.14 | 17.10 | 216.73 | 27.65 | 13.15 | 1.37 | 272.18 | 31.01 |
| 21 | 22 | 169571 | 71.36 | 70.70 | 69.32 | 52.23 | 53.50 | 54.66 | 164.35 | 14.93 | 286.69 | 40.71 | 127.59 | 15.92 | 14.13 | 1.79 | 255.10 | 26.65 |
| 22 | 23 | 47580 | 26.16 | 25.70 | 22.81 | 31.86 | 31.61 | 32.92 | 3.33 | 1.09 | 248.46 | 76.38 | 68.14 | 19.87 | 4.61 | 1.50 | 3.60 | 1.16 |
| 23 | 24 | 288041 | 89.35 | 88.16 | 83.27 | 74.35 | 76.22 | 75.74 | 202.57 | 18.33 | 28.83 | 3.10 | 637.45 | 62.79 | 10.17 | 1.06 | 142.39 | 14.74 |
| 24 | 25 | 183058 | 63.29 | 61.84 | 59.93 | 59.13 | 57.69 | 57.44 | 385.32 | 46.95 | 67.73 | 8.88 | 234.19 | 27.29 | 11.29 | 1.53 | 107.89 | 15.35 |
| 25 | 26 | 158123 | 57.46 | 58.18 | 57.94 | 60.76 | 60.26 | 61.83 | 513.13 | 52.87 | 119.02 | 9.73 | 168.21 | 15.55 | 19.01 | 2.31 | 168.90 | 19.54 |
| 26 | 27 | 219362 | 58.72 | 57.57 | 57.53 | 58.29 | 59.04 | 59.42 | 164.28 | 18.19 | 172.81 | 29.98 | 270.37 | 36.67 | 32.22 | 4.49 | 72.53 | 10.67 |
| 27 | 28 | 199800 | 56.74 | 55.12 | 52.93 | 50.41 | 49.08 | 49.32 | 237.06 | 27.26 | 188.91 | 29.44 | 217.54 | 24.66 | 39.16 | 5.34 | 126.23 | 13.28 |
| 28 | 29 | 132051 | 52.93 | 54.12 | 53.47 | 48.82 | 49.38 | 50.69 | 87.32 | 10.81 | 371.52 | 72.09 | 84.45 | 12.27 | 20.47 | 3.24 | 10.38 | 1.59 |
| 29 | 30 | 197812 | 64.46 | 62.95 | 63.51 | 70.08 | 67.10 | 67.30 | 182.41 | 21.19 | 56.76 | 8.05 | 442.20 | 52.47 | 13.60 | 1.92 | 129.23 | 16.37 |
| 30 | 31 | 300101 | 64.98 | 64.31 | 63.93 | 59.89 | 59.28 | 59.82 | 74.99 | 8.09 | 109.78 | 14.25 | 231.88 | 29.88 | 15.29 | 2.07 | 411.03 | 45.72 |
| 31 | 32 | 61184 | 37.76 | 37.96 | 38.48 | 41.43 | 42.55 | 42.29 | 8.91 | 1.79 | 82.01 | 18.14 | 351.07 | 77.50 | 2.98 | 0.62 | 8.24 | 1.95 |
Let's observe the bottom 5 districts based on their mean scale score for 5th graders and explore if there exists a potential correlation between ethnicity and score. In the dataframe below, the lowest 5 scoring schools are shown. In four of the lowest scoring districts, the majority of the student population was greater than 65% Hispanic. In District 5, 50.71% of the student population was Black.
data = pd.merge(ela_averages, avg_demographics, on="District", how='left') # merge ela averages and demographics
data.groupby('Grade').get_group(5).sort_values(by='Mean Scale Score')[:5] # group by 5th grade and sort to get 5 lowest districts
| District | Grade | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | ... | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 7 | 5 | 7800 | 375.03 | 35.01 | 50.85 | 21.95 | 32.38 | 8.63 | 13.20 | ... | 5.35 | 1.11 | 121.17 | 26.85 | 321.45 | 69.82 | 4.20 | 0.92 | 6.12 | 1.31 |
| 164 | 32 | 5 | 6647 | 375.37 | 32.12 | 47.02 | 22.30 | 32.53 | 9.85 | 14.99 | ... | 8.91 | 1.79 | 82.01 | 18.14 | 351.07 | 77.50 | 2.98 | 0.62 | 8.24 | 1.95 |
| 68 | 12 | 5 | 10679 | 375.81 | 44.21 | 53.94 | 24.41 | 30.46 | 9.47 | 11.98 | ... | 11.12 | 2.13 | 119.37 | 26.79 | 323.95 | 68.57 | 6.20 | 1.22 | 5.86 | 1.29 |
| 26 | 5 | 5 | 4931 | 377.59 | 22.35 | 47.49 | 14.99 | 32.96 | 6.18 | 14.36 | ... | 14.79 | 3.46 | 196.57 | 50.71 | 158.82 | 39.35 | 8.68 | 2.22 | 17.74 | 4.26 |
| 56 | 10 | 5 | 26605 | 377.68 | 44.28 | 44.29 | 32.20 | 33.66 | 15.25 | 15.89 | ... | 48.67 | 4.42 | 104.68 | 18.00 | 437.40 | 71.83 | 9.34 | 1.32 | 36.73 | 4.44 |
5 rows × 31 columns
Let's observe the top 5 districts based on their mean scale score for 5th graders and again explore if there exists a potential correlation between ethnicity and score. In the dataframe below, the highest 5 scoring schools are shown. District 26, 20, and 25 have majority Asian students (nearly 50%). District 2 shows a somewhat even distribution between all ethnicities. District 3 shows an even distribution between Black, Hispanic, and White students.
With the bottom and top 5 scoring districts, there is a correlation between mean scale score and ethnicity. However, the correlation is not strong as minority groups are capable of scoring high on the statewide tests.
# group by 5th grade and sort to get 5 lowest districts
data.groupby('Grade').get_group(5).sort_values(by='Mean Scale Score', ascending=False)[:5]
| District | Grade | Number Tested | Mean Scale Score | # Level 1 | % Level 1 | # Level 2 | % Level 2 | # Level 3 | % Level 3 | ... | # Asian | % Asian | # Black | % Black | # Hispanic | % Hispanic | # Other | % Other | # White | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 2 | 5 | 17497 | 416.15 | 9.23 | 12.37 | 18.32 | 23.94 | 24.58 | 30.78 | ... | 122.04 | 18.55 | 83.58 | 17.66 | 173.82 | 35.67 | 19.88 | 3.85 | 130.18 | 24.27 |
| 134 | 26 | 5 | 12278 | 406.48 | 9.40 | 11.26 | 22.36 | 27.03 | 27.78 | 33.73 | ... | 513.13 | 52.87 | 119.02 | 9.73 | 168.21 | 15.55 | 19.01 | 2.31 | 168.90 | 19.54 |
| 14 | 3 | 5 | 8461 | 400.78 | 12.39 | 26.21 | 15.25 | 25.41 | 15.49 | 21.95 | ... | 40.11 | 5.85 | 114.92 | 29.32 | 167.87 | 38.07 | 21.30 | 3.51 | 155.29 | 23.25 |
| 110 | 20 | 5 | 25318 | 399.86 | 31.82 | 24.27 | 39.04 | 30.78 | 33.32 | 26.54 | ... | 491.28 | 42.56 | 37.22 | 2.81 | 304.12 | 26.28 | 10.66 | 0.91 | 299.57 | 27.45 |
| 128 | 25 | 5 | 17846 | 398.40 | 22.28 | 20.35 | 34.09 | 31.34 | 31.81 | 28.99 | ... | 385.32 | 46.95 | 67.73 | 8.88 | 234.19 | 27.29 | 11.29 | 1.53 | 107.89 | 15.35 |
5 rows × 31 columns
Let's confirm if there exists a correlation between ethnicity and score using the Pearson correlation ranker.
indexes = [2,3,5,7,9,11,13,22,24,26,28,30] #limit the columns (remove district and grade and # values)
corrdf = data[data.columns[indexes]]
corrdf.corr().style.background_gradient(cmap="Blues")
| Number Tested | Mean Scale Score | % Level 1 | % Level 2 | % Level 3 | % Level 4 | % Level 3+4 | % Asian | % Black | % Hispanic | % Other | % White | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number Tested | 1.000000 | 0.368846 | -0.343676 | -0.202475 | 0.320901 | 0.251663 | 0.343875 | 0.386292 | -0.468802 | -0.029436 | 0.132512 | 0.528768 |
| Mean Scale Score | 0.368846 | 1.000000 | -0.890666 | -0.692849 | 0.768800 | 0.803148 | 0.949262 | 0.680190 | -0.451261 | -0.352591 | 0.482451 | 0.708713 |
| % Level 1 | -0.343676 | -0.890666 | 1.000000 | 0.445569 | -0.772023 | -0.802885 | -0.950980 | -0.715020 | 0.379717 | 0.432920 | -0.459034 | -0.672210 |
| % Level 2 | -0.202475 | -0.692849 | 0.445569 | 1.000000 | -0.574686 | -0.586043 | -0.700584 | -0.525527 | 0.421374 | 0.182312 | -0.298379 | -0.533380 |
| % Level 3 | 0.320901 | 0.768800 | -0.772023 | -0.574686 | 1.000000 | 0.372491 | 0.813911 | 0.590863 | -0.319065 | -0.342174 | 0.332861 | 0.540944 |
| % Level 4 | 0.251663 | 0.803148 | -0.802885 | -0.586043 | 0.372491 | 1.000000 | 0.842353 | 0.652223 | -0.419971 | -0.334376 | 0.440368 | 0.648343 |
| % Level 3+4 | 0.343875 | 0.949262 | -0.950980 | -0.700584 | 0.813911 | 0.842353 | 1.000000 | 0.751451 | -0.448200 | -0.408059 | 0.468960 | 0.720042 |
| % Asian | 0.386292 | 0.680190 | -0.715020 | -0.525527 | 0.590863 | 0.652223 | 0.751451 | 1.000000 | -0.476565 | -0.408798 | 0.161347 | 0.499232 |
| % Black | -0.468802 | -0.451261 | 0.379717 | 0.421374 | -0.319065 | -0.419971 | -0.448200 | -0.476565 | 1.000000 | -0.505113 | 0.086237 | -0.519067 |
| % Hispanic | -0.029436 | -0.352591 | 0.432920 | 0.182312 | -0.342174 | -0.334376 | -0.408059 | -0.408798 | -0.505113 | 1.000000 | -0.408449 | -0.307251 |
| % Other | 0.132512 | 0.482451 | -0.459034 | -0.298379 | 0.332861 | 0.440368 | 0.468960 | 0.161347 | 0.086237 | -0.408449 | 1.000000 | 0.308263 |
| % White | 0.528768 | 0.708713 | -0.672210 | -0.533380 | 0.540944 | 0.648343 | 0.720042 | 0.499232 | -0.519067 | -0.307251 | 0.308263 | 1.000000 |
Let's take a close look at the correlation between ethnicity and percentage of students in the district who received a satisfactory grade of 3 or 4.
% Asian vs % Level 3+4
There is a strong positive correlation of 0.751451 between the percentage of Asian students and students with 3 or 4. This indicates that as the percentage of Asian students increases, the percentage of students who receive 3 or 4 also increases. This means the population of Asian students possibly influences the percentage of students who receive a satisfactory score. This correlation suggests that a higher percenteage of Asian students results in more students who receive a 3 or 4.
% Black vs % Level 3+4
There is a weak negative correlation of -0.448200 between the percentage of Black students and students with 3 or 4. This indicates that as the percentage of Black students increases, the percentage of students who receive 3 or 4 decreases. This indicates a possible influence of ethnicity on the percentage of students who receive a satisfactory score. This correlation suggests that a higher percentage of Black students results in less students who receive 3 or 4.
% Asian vs % Level 3+4
def scatterPlot(col1, col2):
plt.scatter(corrdf[col1], corrdf[col2])
plt.xlabel(col1)
plt.ylabel(col2)
scatterPlot('% Asian', '% Level 3+4')
% Black vs % Level 3+4
scatterPlot('% Black', '% Level 3+4')
% White vs % Level 3+4
scatterPlot('% White', '% Level 3+4')
% Hispanic vs % Level 3+4
scatterPlot('% Hispanic', '% Level 3+4')
These scatter plots suggest that for minority groups, the schools with higher populations of minority students, the lower the average statewide test score. While this may not be directly influenced by the ethnicity, there is a suggestion that there exists this influence.
school_map = folium.Map(location=[40.69022469799167, -73.9871222729115], zoom_start=10)
folium.Choropleth(
geo_data='School Districts.geojson',
name='School Districts',
fill_opacity=0.3,
line_opacity=1
).add_to(school_map)
school_map
data['District'] = data['District'].astype(str)
score_map = folium.Map(location=[40.69022469799167, -73.9871222729115], zoom_start=10)
folium.Choropleth(
geo_data='School Districts.geojson',
fill_opacity=0.75,
line_opacity=1,
threshold_scale = list(range(100, 500, 50)),
data = data,
key_on='feature.properties.school_dist',
columns = ['District', 'Mean Scale Score']
).add_to(score_map)
<folium.features.Choropleth at 0x7fd829389f70>